Text segmentation in web images using colour perception and topological features
نویسنده
چکیده
The research presented in this thesis addresses the problem of Text Segmentation in Web images. Text is routinely created in image form (headers, banners etc.) on Web pages, as an attempt to overcome the stylistic limitations of HTML. This text however, has a potentially high semantic value in terms of indexing and searching for the corresponding Web pages. As current search engine technology does not allow for text extraction and recognition in images, the text in image form is ignored. Moreover, it is desirable to obtain a uniform representation of all visible text of a Web page (for applications such as voice browsing or automated content analysis). This thesis presents two methods for text segmentation in Web images using colour perception and topological features. The nature of Web images and the implicit problems to text segmentation are described, and a study is performed to assess the magnitude of the problem and establish the need for automated text segmentation methods. Two segmentation methods are subsequently presented: the Split-and-Merge segmentation method and the Fuzzy segmentation method. Although approached in a distinctly different way in each method, the safe assumption that a human being should be able to read the text in any given Web Image is the foundation of both methods’ reasoning. This anthropocentric character of the methods along with the use of topological features of connected components, comprise the underlying working principles of the methods. An approach for classifying the connected components resulting from the segmentation methods as either characters or parts of the background is also presented.
منابع مشابه
Two Approaches for Text Segmentation in Web Images
There is a significant need to recognise the text in images on web pages, both for effective indexing and for presentation by non-visual means (e.g., audio). This paper presents and compares two novel methods for the segmentation of characters for subsequent extraction and recognition. The novelty of both approaches is the combination of (different in each case) topological features of characte...
متن کاملColour text segmentation in web images based on human perception
There is a significant need to extract and analyse the text in images on Web documents, for effective indexing, semantic analysis and even presentation by non-visual means (e.g., audio). This paper argues that the challenging segmentation stage for such images benefits from a human perspective of colour perception in preference to RGB colour space analysis. The proposed approach enables the seg...
متن کاملFuzzy Segmentation of Characters in Web Images Based on Human Colour Perception
This paper describes a new approach for the segmentation of characters in images on Web pages. In common with the authors’ previous work in this subject, this approach attempts to emulate the ability of humans to differentiate between colours. In this case, pixels of similar colour are first grouped using a colour distance defined in a perceptually uniform colour space (as opposed to the common...
متن کاملText Extraction from Web Images Based on Human Perception and Fuzzy Inference
There is a significant need to extract and recognise the semantically-important text contained in images on Web pages. This paper proposes a new approach to text extraction from this special class of images. The method attempts to emulate closer than before the way humans perceive colour differences in order to differentiate between text and background regions. Pixels of similar colour (as huma...
متن کاملDocument Analysis And Classification Based On Passing Window
In this paper we present Document analysis and classification system to segment and classify contents of Arabic document images. This system includes preprocessing, document segmentation, feature extraction and document classification. A document image is enhanced in the preprocessing by removing noise, binarization, and detecting and correcting image skew. In document segmentation, an algorith...
متن کامل